Extracting Attributes and Their Values from Web Pages

نویسنده

  • Minoru YOSHIDA
چکیده

We propose a method for extracting attributes and their values from Web pages. Our method makes use of word distributions estimated from plain Web pages. The key idea is to estimate word distribution by consulting ontologies built from HTML tables. In a series of experiments, we show that estimated word distributions are useful for extracting attributes and their values in various kinds of HTML representations other than tables.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Information Extraction from Hypertext Mark-Up Language Web Pages

Problems statement: Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various HTML information sources. The issue of correlating, integrating and presenting related information to users becomes important. When a user uses a search engine such as Yahoo and Google to seek specific information, the results are not only information about ...

متن کامل

The WDC Gold Standards for Product Feature Extraction and Product Matching

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for prod...

متن کامل

بررسی ارتباط بین کیفیت اطلاعات و شاخص های ظاهری در صفحات وب فارسی مرتبط با حوزه سلامت عمومی

  Introduction: One approach to evaluate the quality of a web page is to investigate its external markers. The purpose of the present study is to determine the relationship between information quality of Persian public health web pages and their external quality.   Methods: The samples of this correlation study were selected from among the freely available ten-key word texts of chronic diseases...

متن کامل

Finding and Extracting Data Records from Web Pages

Many HTML pages are generated by software programs by querying some underlying databases and then filling in a template with the data. In these situations the metainformation about the data structure is lost, so automated software programs cannot process these data in such powerful manners as information from databases. We propose a set of novel techniques for detecting structured records in a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002